Sequence Factorization with Multiple References
نویسندگان
چکیده
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.
منابع مشابه
From torsion theories to closure operators and factorization systems
Torsion theories are here extended to categories equipped with an ideal of 'null morphisms', or equivalently a full subcategory of 'null objects'. Instances of this extension include closure operators viewed as generalised torsion theories in a 'category of pairs', and factorization systems viewed as torsion theories in a category of morphisms. The first point has essentially been treated in [15].
متن کاملRiordan group approaches in matrix factorizations
In this paper, we consider an arbitrary binary polynomial sequence {A_n} and then give a lower triangular matrix representation of this sequence. As main result, we obtain a factorization of the innite generalized Pascal matrix in terms of this new matrix, using a Riordan group approach. Further some interesting results and applications are derived.
متن کاملSupporting Information for “Evolutionary profiles from the QR factorization of multiple sequence alignments”
متن کامل
Position-dependent motif characterization using non-negative matrix factorization
MOTIVATION Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simult...
متن کاملSensitive pattern discovery with 'fuzzy' alignments of distantly related proteins
MOTIVATION Evolutionary comparison leads to efficient functional characterisation of hypothetical proteins. Here, our goal is to map specific sequence patterns to putative functional classes. The evolutionary signal stands out most clearly in a maximally diverse set of homologues. This diversity, however, leads to a number of technical difficulties. The targeted patterns-as gleaned from structu...
متن کامل